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1. INTRODUCTION 

Recently, machine learning (ML) and artificial intelligence (AI) advance quickly in various 
domains, including text, image, speech and games. In these domains, machines perform as well as or better 
than humans [1]-[6]. The increase in the size of training data is a key for this improvement. The two ways to 
have larger training sets are i) acquiring new original data or ii) employing data augmentation techniques. 
Data augmentation is a method to increase the amount of data for training models without collecting new 
data. For example, in the case of image processing, a photo of an orange could be used to generate thousands 
of different orange images by various image processing techniques such as rotating, flipping, and blurring. 
Data augmentation has been proven fruitful in improving model performance [7]. 

However, there is no analogous progress in prediction performance in the studies of stock 
prediction. For example, economists could not predict the S&P 500's 40 percent plunge in March 2020 and a 
subsequent new high in August 2020. Unlike other domains, the age of big data does not significantly 
increase the size of training sets in stock prediction studies. In other domains such as image classification, the 
number of images available for training increased exponentially in recent years. Nevertheless, the number of 
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stock series is relatively the same. Moreover, well-accepted data augmentation techniques for stock price 
series do not exist [8], [9]. Few data augmentation techniques for stock data have been proposed. Existing 
studies apply data augmentation techniques based on signal processing to stock price prediction [10], [11]. 
However, stock prices are different from physical waves. Therefore, stock data augmentation based on signal 
processing has no solid economic foundation. To our knowledge, this study is the first data augmentation 
study on stock return prediction with a solid economic foundation. Unlike existing studies, our data 
augmentation is sensible and mimics actual financial asset creation. New augmented assets and price series 
are generated from a linear combination of existing assets/stocks. 

The paper is organized as section | explains the data augmentation process. Section 2 proposes a 
data augmentation technique. Section 3 and 4 apply the proposed data augmentation for a stock return 
prediction and show that the data augmentation significantly improves prediction performance. In section 5, 
we study how the characteristics of original data affect the performance of data augmentation. The final 
section concludes. 


2. DATA AUGMENTATION METHOD 
2.1. Generating new assets 

Table 1 depicts well-accepted data augmentation employed in various domains. In these domains, 
data augmentation is mostly based on invariant transformation. For example, image generating processes 
such as rotation and reflection are used for image augmentation. In natural language processing (NLP), back 
translation and synonyms are used to enrich the training data and enhance training performance. In speech 
recognition and time-series classification, time warping is used for data augmentation. However, to our 
knowledge, in existing studies, there is no well-accepted data generating process for stock data augmentation. 
We propose a sensible stock data augmentation mimicking actual financial asset creation processes. 


Table 1. Well accepted data augmentation in various domains 


Data type/domains Data augmentation process Studies 

Image Rotation, blurring, reflection [12], [13] 

Text Back translation, synonym [14], [15] 

Speech and time series Time warping [16]-[19] 
Stocks - - 


Data augmentation for stocks generates new stocks/assets and price series from existing stocks. In 
financial asset management, the most common way to create a new asset is by combining underlying assets 
with fixed weights. For example, well-known assets constructed by weighting underlying assets are S&P 500 
index funds. An S&P 500 index fund is market-capitalization-weighted of 505 U.S. stocks. The index 
accounts for 80% value of the U.S. stock market. Table 2 shows the five stocks with the highest weights in 
the S&P 500 index in September 2021. 


Table 2. Five stocks with the highest weights in the S&P 500 index 


Company Stock symbol Weight (%) 
Apple AAPL 6.2 
Microsoft MSFT 5.9 
Amazon AMZN 3.9 
Facebook FB 2.4 
Alphabet GOOGL 2.3 


Under competitive financial markets, the price of the new asset created by combining underlying 
assets is precisely the weighted average of the price of each underlying asset. We therefore can synthesize a 
new augmented asset x; and its price series by combining underlying stock s; with weight wij. Consequently, 
the price of this augmented asset is Px = Wi Ps; Where psi is the price of stock i, N is the number of 


underlying stocks/assets. The weight w; is all positive. Throughout this paper, prices of augmented assets will 
be constructed in this fashion. Figure | shows the price series of a new asset constructed from 0.5 Apple 
(APPL) stock and 0.5 Microsoft (MSFT) stock. The price of the new asset is the mid point between the prices 
of APPL and MSFT stocks. 
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Figure 1. New asset price series 


Technically, the data augmentation in (1) is similar to an image data augmentation in [20]. 
Zhang et.al. [20], new images are created by a mixup process: averaging the pixels of two original images. 
Figure 2 shows a mixup of a dog image and a cat image. This image augmentation brings performance 
improvement in various datasets. Table 3 shows the differences between the mixup process and our data 
augmentation. Mixup is applied for image classification problems. Our data augmentation is for stock return 
regression problems. While mixup has no real-world analog, our data augmentation is based on real-world 
financial asset creation. The mixup process only applies to two original images. Our augmentation could 
apply to any number of original stocks greater or equal to 2. Compared with mixup, our augmentation could 
apply to many more combination of original data points and could potentially generate much larger 
augmented datasets. 


Figure 2. Mixup image of a dog and a cat 


Table 3. Mixup and our data augmentation 


Mixup Our data augmentation 
Domain Images Times series of stock prices 
Real-world foundation No Yes 
Type of problems Classification Regression 
Number of data point combined for 2 images Any k = 2 stock series 
augmentation 
Number of all possible combinations of N(N-1)/2 2%-N-1 


original data points for augmentation 
(Nis the number of original datapoints) 


2.2. Data, feature, and target 

The historical adjusted closing prices of each stock in S&P500 from 1 January 2002 to 31 March 2020 
are employed for studies. Stocks in S&P500 are extensively used in studies on stock forecasting [21], [22]. The 
data set will be separated into two parts: 1/1/2002-30/11/2016 and 1/1/2017-31/3/2020. The first and second 
parts are respectively for training and testing. 

The target variable is the 20-day forward returns of each stock. Features employed for predicting 
future returns are standard technical indicators shown in Table 4. The technical indicators used are rolling 
volatility, simple moving average normalized with 5-day simple moving average, rolling median to mean, 
rolling standard deviation to mean, and rolling return. The sizes of rolling windows are 10, 20, 40, 60, 80, 
120, and 240 days. Totally, there are 35 features. All the features are generated from the price series of each 
stock. These features are features commonly used in stock forecasting literature [23]—[25]. 
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Table 4. Feature and target 


Variable type Name Size of rolling windows (days) 
Target Forward return 20 days 
Feature Rolling volatility 10, 20, 40, 60, 80, 120, 240 days 
Feature Simple Moving Average 10, 20, 40, 60, 80, 120, 240 days 
Feature Rolling median to mean 10, 20, 40, 60, 80, 120, 240 days 
Feature Rolling standard deviation to mean 10, 20, 40, 60, 80, 120, 240 days 
Feature Rolling return 10, 20, 40, 60, 80, 120, 240 days 

3. RESULTS 


This section studies the effect of data augmentation on prediction performance measured by root 
mean square error (RMSE) using augmented data with various sizes. The baseline root mean square errors 
(RMSEs) with no data augmentation is reported in section 3.1. Then, these RMSEs are compared with those 
with data augmentation in section 3.2. 


3.1. Baseline results without data augmentation 

We first show the baseline prediction performance without data augmentation. To focus on data 
augmentation, we use a standard ML algorithm in all trials. For this purpose, throughout the paper, light 
gradient boosted machine (LightGBM) with default setting is employed with no hyper-parameter tuning. 
LightGBM is a well-established ML model developed and maintained by Microsoft. Our data augmentation 
strategy depends on the number of original stocks (N). To investigate how the data augmentation might be 
affected by N, we conduct trials with N=10, 20, 30, ..., 100. For each value of N, there are 30 trials. In each 
trial, N stocks are randomly picked from the S&P 500 stocks. Then the price series of the picked N stocks are 
used as the primary data source for training and testing in each trial. With 10 values of N, there are totally 
10x30 trials. 

The mean and standard deviation of RMSEs of the trials for each value of N in test sets is shown in 
Table 5. As expected, the RMSEs and their standard deviations decrease as the size of training sets and N 
increases. The decrease in mean and standard deviation (SD) of RMSEs indicates that larger datasets give 
better prediction performance. 


Table 5. Baseline RMSEs 


Number of original stocks (N) Mean of test RMSEs SD of test RMSEs 
10 0.0921 0.0088 
20 0.0909 0.0056 
30 0.0904 0.0056 
40 0.0897 0.0048 
50 0.0898 0.0040 
60 0.0896 0.0035 
70 0.0894 0.0028 
80 0.0896 0.0031 
90 0.0896 0.0028 
100 0.0896 0.0026 


3.2. Data augmentation results 

We now study learning results with data augmentation. As shown in Figure 3, in each of 10x30 
trials with N original stocks, kN additional synthetic assets, and their data are generated and used in training. 
We experiment with 3 values of k=5, 10, 20. Therefore with data augmentation, there are 10x30x3 trials. A 
synthetic price series is created from a random weighted sum of the N original price series. Each weight is 
randomly distributed according to a standard uniform distribution. After the data augmentation, the 
LightGBM model is trained using the augmented data. The trained model is then used to predict the forward 
return of the N original stocks in each trial. 

Table 6 reports the average percentage decreases of RMSEs from the models trained with 
augmented data compared to those without augmented data from trials grouped by the number of original 
stocks (N) and augmentation ratio (k). All the percentage decreases are positive in Table 6; the prediction 
accuracy improves in all cases. We apply t-tests to test whether the decreases in RMSEs are different from 
zero. In all cases, the decreases in RMSEs are highly significantly different from zero. Our data augmentation 
significantly improves prediction performance for all Ns and ks. 

As k increases, the decreases in RMSEs are higher. The larger the additional data from data 
augmentation, the higher the prediction accuracy. Note that the decreases in RMSEs are larger in training sets 
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with small N; data augmentation improves learning performance more in training with small datasets. These 
results show that the performance improved from data augmentation is robust and behaves as expected. 


Random weight 
linear combination 


N original series KN augmented series 


Machine learning 


Figure 3. Data augmentation framework 


Table 6. Percentage decrease of RMSEs in test set 


Number of Stocks (N) Augmentation ratio (k) 


5 10 20 
10 1.977*** 252758* 2.974 8e* 
20 0.041*** 0.904*** 1.411*** 
30 0.594 *** 0.959*** 1.250*** 
40 0.508*** 0.748 *** 1.095*** 
50 0.484*** 0.800*** 1.110*** 
60 0.45 1*** 0.684*** 0.967*** 
70 0.441*** 0.652*** 0.925*** 
80 0.475 *** 0.713*** 0.888*** 
90 0.389%** 0.630*** 0.900*** 
100 0.415*** 0.649*** O.9TT#E* 


Significance levels: *p<0.1; **p<0.05; ***p<0.01 


3.3. Return correlation and data augmentation performance 

In this section, we investigate how the augmentation performance is affected by the characteristics 
of original stocks. The focus characteristic is the average correlation of original stock returns in each trial. 
From a financial perspective, correlations of stock returns are one the most important indicators of a 
portfolio. From a ML perspective, because augmented series are generated by averaging the original stock 
series, we expect that similarity/difference of the original series could affect the augmentation performance. 
For example, in an extreme case in which all original series are constant and unrelated, augmenting data by 
averaging the original series obviously could not improve the learning performance. In Figure 4, the scatter 
plot of the drop in RMSEs and the average pairwise correlations of the returns of the original N stocks in the 
training sets of each trial is exhibited. The plots suggest a U-shape relationship of return correlation and the 
drop in RMSEs. A high drop of RMSEs occurs in the region with high and low return correlations. 

To verify the U-shape relationship, linear regressions in which percentage decreases in RMSEs are 
the dependent variable are estimated. The explanatory variables are returm correlations and their squared. The 
other control variables are the number of original stocks (NV) and the augmentation ratio (k). The estimated 
regression coefficients are shown in Table 7. The first model shows that the augmentation ratio is 
significantly positive, while the coefficient of N is significantly negative. These two variables have the 
expected signs. The second model adds retum correlation as another explanatory variable. Its coefficient is 
not significant and shows that the relationship of return correlations and the decreases in RMSESs is not linear. 
The quadratic term of return correlations is added in the third model. In this model, all coefficients are highly 
significant with expected signs. This model confirms the non-linear relationship of the decreases in RMSEs 
and the return correlations of the original stocks; our data augmentation performs better in the region of high 
and low correlations. 
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Figure 4. Return correlation and decrease in RMSE 
Table 7. Return correlation and decrease in RMSE? 
: Model 
Variables I Il a 
# of original stocks (NV) -0.012*** -0.012*** -0.009*** 
Augmentation ratios (k) 0.041*** 0.041 *** 0.041** 
Return correlation 2.001 -112.049*** 
Return correlation squared 158.650*** 
Constant 1.120*** 0.402 20.657*** 
R’ 0.177 0.180 0.206 
# of observations 900 900 900 


Significance levels: *p<0.1; **p<0.05; ***p<0.01 


4. CONCLUSION 


This paper proposes a simple and effective data augmentation technique for stock retum predictions. 
New synthetic stocks are generated form linear combinations of original stocks. Unlike existing literature, 
our technique is intuitive and mimics real asset creation in financial markets. Our data augmentation 
significantly improves prediction in test sets. The larger the size of augmented data, the larger the 
improvement. Moreover, regression analysis shows a U-shape relationship between the return correlations of 
original stocks and the prediction improvement from data augmentation. Our data augmentation works better 
for groups of original stock with high or low correlation. 
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